Computer Method and Apparatus for Processing Image Data

ABSTRACT

A data compression method and apparatus that includes detecting a portion of a signal comprising a sequence of video frames that uses a disproportionate amount of bandwidth compared to other portions of the signal. The detected portion of the signal result in determined components of interest. Relative to certain variance, these components of interest are normalized to generate an intermediate form, which represents the components of interest reduced in complexity by the certain variance and enables a compressed form of the signal that maintains saliency. The detecting includes any of:
         (i) analyzing image gradients across frames where image gradient is a first derivative model and gradient flow is a second derivative,   (ii) integrating finite differences of pels temporally/spatially to form a derivative model,   (iii) analyzing an illumination field across frames, and   (iv) predictive analysis, to determine bandwidth consumption, which is used to determine the components of interest.

RELATED APPLICATIONS

This application is a divisional of U.S. application Ser. No.12/522,322, filed Jul. 7, 2009, which is the U.S. National Stage ofInternational Application No. PCT/US2008/000090, filed on Jan. 4, 2008,which designates the U.S., published in English, claims the benefit ofU.S. Provisional Application No. 60/881,966, filed on Jan. 23, 2007.U.S. application Ser. No. 12/522,322 is also a continuation-in-part ofU.S. application Ser. No. 11/396,010, filed Mar. 31, 2006, which is acontinuation-in-part of U.S. application Ser. No. 11/336,366 filed Jan.20, 2006, which is a continuation-in-part of U.S. application Ser. No.11/280,625 filed Nov. 16, 2005 which is a continuation-in-part of U.S.application Ser. No. 11/230,686 filed Sep. 20, 2005 which is acontinuation-in-part of U.S. application Ser. No. 11/191,562 filed Jul.28, 2005, now U.S. Pat. No. 7,158,680. U.S. application Ser. No.11/396,010 also claims priority to U.S. Provisional Application No.60/667,532, filed Mar. 31, 2005 and U.S. Provisional Application No.60/670,951, filed Apr. 13, 2005. This application is related to U.S.Provisional Application No. 60/811,890, filed Jun. 9, 2006. The entireteachings of the above applications are incorporated herein byreference.

FIELD OF THE INVENTION

The present invention is generally related to the field of digitalsignal processing, and more particularly, to computer apparatus andcomputer-implemented methods for the efficient representation andprocessing of signal or image data, and most particularly, video data.

BACKGROUND OF THE INVENTION

The general system description of the prior art in which the currentinvention resides can be expressed as in FIG. 1. Here a block diagramdisplays the typical prior art video processing system. Such systemstypically include the following stages: an input stage 102, a processingstage 104, an output stage 106 and one or more data storage mechanisms108.

The input stage 102 may include elements such as camera sensors, camerasensor arrays, range finding sensors or a means of retrieving data froma storage mechanism. The input stage provides video data representingtime correlated sequences of man made and/or naturally occurringphenomena. The salient component of the data may be masked orcontaminated by noise or other unwanted signals.

The video data, in the form of a data stream, array or packet, may bepresented to the processing stage 104 directly or through anintermediate storage element 108 in accordance with a predefinedtransfer protocol. The processing stage may take the form of dedicatedanalog or digital devices or programmable devices such as centralprocessing units (CPUs), digital signal processors (DSPs) or fieldprogrammable gate arrays (FPGAs) to execute a desired set of video dataprocessing operations. The processing stage 104 typically includes oneor more CODECs (COder/DECoders).

Output stage 106 produces a signal, display or other response which iscapable of affecting a user or external apparatus. Typically, an outputdevice is employed to generate an indicator signal, a display, a hardcopy, a representation of processed data in storage or to initiatetransmission of data to a remote site. It may also be employed toprovide an intermediate signal or control parameter for use insubsequent processing operations.

Storage is presented as an optional element in this system. Whenemployed, storage element 108 may be either non-volatile, such asread-only storage media, or volatile, such as dynamic random accessmemory (RAM). It is not uncommon for a single video processing system toinclude several types of storage elements, with the elements havingvarious relationships to the input, processing and output stages.Examples of such storage elements include input buffers, output buffersand processing caches.

The primary objective of the video processing system in FIG. 1 is toprocess input data to produce an output which is meaningful for aspecific application. In order to accomplish this goal, a variety ofprocessing operations may be utilized, including noise reduction orcancellation, feature extraction, object segmentation and/ornormalization, data categorization, event detection, editing, dataselection, data re-coding and transcoding.

Many data sources that produce poorly constrained data are of importanceto people, especially sound and visual images. In most cases theessential characteristics of these source signals adversely impact thegoal of efficient data processing. The intrinsic variability of thesource data is an obstacle to processing the data in a reliable andefficient manner without introducing errors arising from naive empiricaland heuristic methods used in deriving engineering assumptions. Thisvariability is lessened for application when the input data arenaturally or deliberately constrained into narrowly definedcharacteristic sets (such as a limited set of symbol values or a narrowbandwidth). These constraints all too often result in processingtechniques that are of low commercial value.

The design of a signal processing system is influenced by the intendeduse of the system and the expected characteristics of the source signalused as an input. In most cases, the performance efficiency requiredwill also be a significant design factor. Performance efficiency, inturn, in affected by the amount of data to be processed compared withthe data storage available as well as the computational complexity ofthe application compared with the computing power available.

Conventional video processing methods suffer from a number ofinefficiencies which are manifested in the form of slow datacommunication speeds, large storage requirements and disturbingperceptual artifacts. These can be serious problems because of thevariety of ways people desire to use and manipulate video data andbecause of the innate sensitivity people have for some forms of visualinformation.

An “optimal” video processing system is efficient, reliable and robustin performing a desired set of processing operations. Such operationsmay include the storage, transmission, display, compression, editing,encryption, enhancement, categorization, feature detection andrecognition of the data. Secondary operations may include integration ofsuch processed data with other information sources. Equally important,in the case of a video processing system, the outputs should becompatible with human vision by avoiding the introduction of perceptualartifacts.

A video processing system may be described as “robust” if its speed,efficiency and quality do not depend strongly on the specifics of anyparticular characteristics of the input data. Robustness also is relatedto the ability to perform operations when some of the input iserroneous. Many video processing systems fail to be robust enough toallow for general classes of applications—providing only application sothe same narrowly constrained data that was used in the development ofthe system.

Salient information can be lost in the discretization of acontinuous-valued data source due to the sampling rate of the inputelement not matching the signal characteristics of the sensed phenomena.Also, there is loss when the signal's strength exceeds the sensor'slimits, resulting in saturation. Similarly, information is lost when theprecision of input data is reduced as happens in any quantizationprocess when the full range of values in the input data is representedby a set of discrete values, thereby reducing the precision of the datarepresentation.

Ensemble variability refers to any unpredictability in a class of dataor information sources. Data representative of visual information has avery large degree of ensemble variability because visual information istypically unconstrained. Visual data may represent any spatial arraysequence or spatio-temporal sequence that can be formed by lightincident on a sensor array.

In modeling visual phenomena, video processors generally impose some setof constraints and/or structure on the manner in which the data isrepresented or interpreted. As a result, such methods can introducesystematic errors which would impact the quality of the output, theconfidence with which the output may be regarded and the type ofsubsequent processing tasks that can reliably be performed on the data.

Quantization methods reduce the precision of data in the video frameswhile attempting to retain the statistical variation of that data.Typically, the video data is analyzed such that the distributions ofdata values are collected into probability distributions. There are alsomethods that project the data into phase space in order to characterizethe data as a mixture of spatial frequencies, thereby allowing precisionreduction to be diffused in a less objectionable manner. When utilizedheavily, these quantization methods often result in perceptuallyimplausible colors and can induce abrupt pixilation in originally smoothareas of the video frame.

Different coding is also typically used to capitalize on the localspatial similarity of data. Data in one part of the frame tend to beclustered around similar data in that frame, and also in a similarposition in subsequent frames. Representing the data in terms of itsspatially adjacent data can then be combined with quantization and thenet result is that, for a given precision, representing the differencesis more accurate than using the absolute values of the data. Thisassumption works well when the spectral resolution of the original videodata is limited, such as in black and white video, or low-color video.As the spectral resolution of the video increases, the assumption ofsimilarity breaks down significantly. The breakdown is due to theinability to selectively preserve the precision of the video data.

Residual coding is similar to differential encoding in that the error ofthe representation is further differentially encoded in order to restorethe precision of the original data to a desired level of accuracy.

Variations of these methods attempt to transform the video data intoalternate representations that expose data correlations in spatial phaseand scale. Once the video data has been transformed in these ways,quantization and differential coding methods can then be applied to thetransformed data resulting in an increase in the preservation of thesalient image features. Two of the most prevalent of these transformvideo compression techniques are the discrete cosine transform (DCT) andthe discrete wavelet transform (DWT). Error in the DCT transformmanifests in a wide variation of video data values, and therefore, theDCT is typically used on blocks of video data in order to localize thesefalse correlations. The artifacts from this localization often appearalong the border of the blocks. For the DWT, more complex artifactshappen when there is a mismatch between the basis function and certaintextures, and this causes a blurring effect. To counteract the negativeeffects of DCT and DWT, the precision of the representation is increasedto lower distortion at the cost of precious bandwidth.

SUMMARY OF THE INVENTION

The present invention builds on the subject matter disclosed in theprior related applications by further adding a statistical analysis todetermine an approximation of the normalized pel data. Thisapproximation is the “encoded” form of the normalized pel data. Thestatistical analysis is achieved through a linear decomposition of thenormalized pel data, specifically implemented as a Singular ValueDecomposition (SVD) which can be generally referred to as PrincipalComponent Analysis (PCA) in this case. The result of this operation is aset of one or more basis vectors. These basis vectors can be used toprogressively describe ever more accurate approximations of thenormalized pel data. As such, a truncation of one or more of the leastsignificant basis vectors is performed to produce an encoding that issufficient to represent the normalized pel data to a required qualitylevel.

In general, PCA cannot be effectively applied to the original videoframes. But, once the frames have been segmented and further normalized,the variation in the appearance of the pels in those frames no longerhas the interference of background pels or the spatial displacementsfrom global motion. Without these two forms of variation in the videoframe data, PCA is able to more accurately approximate the appearance ofthe normalized pel data using fewer basis vectors than it wouldotherwise. The resulting benefit is a very compact representation, interms of bandwidth, of the original appearance of the object in thevideo.

The truncation of basis vectors can be performed in several ways, andeach truncation is considered to be a form of precision analysis whencombined with PCA itself. This truncation can simply be the describedexclusion of entire basis vectors from the set of basis vectors.Alternatively, the vector element and/or element bytes and/or bits ofthose bytes can be selectively excluded (truncated). Further, the basisvectors themselves can be transformed into alternate forms that wouldallow even more choices of truncation methods. Wavelet transform usingan Embedded Zero Tree truncation is one such form.

Generating normalized pel data and further reducing it to the encodedpel data provides a data representation of the appearance of the peldata in the original video frame. This representation can be useful inand of itself, or as input for other processing. The encoded data may becompact enough to provide an advantageous compression ratio overconventional compression without further processing.

The encoded data may be used in place of the “transform coefficients” inconventional video compression algorithms. In a conventional videocompression algorithm, the pel data is “transform encoded” using aDiscrete Cosine Transform (DCT). The resulting “transform coefficients”are then further processed using quantization and entropy encoding.Quantization is a way to lower the precision of the individualcoefficients. Entropy encoding is a lossless compression of thequantized coefficients and can be thought of in the same sense aszipping a file. The present invention is generally expected to yield amore compact encoded vector than DCT, thereby allowing a highercompression ratio when used in a conventional codec algorithm.

In a preferred embodiment, components of interest (i.e., the interestingportions of a video signal) are determined as a function ofdisproportionate bandwidth consumption and image gradients over time.The components of interest (determined portions of video signal) arenormalized relative to global structure, global motion and pose, localdeformation and/or illumination. Such normalization reduces thecomplexity of the components of interest in a manner that enablesapplication of geometric data analysis techniques with increasedeffectiveness.

In particular, the video signal is a sequence of frames, the detectionof disproportionate bandwidth includes any of:

-   -   (i) analyzing image gradients across one or more frames,    -   (ii) integrating finite differences of pels temporally or        spatially to form a derivative model, where image gradient is a        first derivative and gradient flow is a second derivative,    -   (iii) analyzing an illumination field across one or more frames,        and    -   (iv) predictive analysis,        to determine bandwidth consumption. The determined bandwidth        consumption is used to determine the components of interest (or        interesting portions of the video signal).

The determined components of interest contain structural informationincluding any combination of spatial features and correspondence ofspatial features (motion).

In accordance with one aspect of the present invention, the step ofnormalizing involves forming a structural model and an appearance modelof the determined components of interest. The preferred embodimentapplies geometric data analysis techniques to at least the appearancemodel and/or structural model. The reduction in complexity of thecomponents of interest enables application of geometric data analysis ina substantially increased effective manner. The geometric data analysistechniques include any of linear decomposition and nonlineardecomposition. Preferably, linear decomposition employs any of:sequential PCA, power factorization, generalized PCA, and progressivePCA. Progressive PCA may include wavelet transform techniques combinedwith PCA.

BRIEF DESCRIPTION OF THE DRAWINGS

The foregoing will be apparent from the following more particulardescription of example embodiments of the invention, as illustrated inthe accompanying drawings in which like reference characters refer tothe same parts throughout the different views. The drawings are notnecessarily to scale, emphasis instead being placed upon illustratingembodiments of the present invention.

FIG. 1 is a block diagram illustrating a prior art video processingsystem.

FIG. 2 is a block diagram of a system for processing video dataaccording to the principles of the present invention.

FIGS. 2 a and 2 b are schematic and block diagrams of a computerenvironment in which embodiments of the present invention operate.

FIG. 3 is a block diagram illustrating the motion estimation method ofthe invention.

FIG. 4 is a block diagram illustrating the global registration method ofthe invention.

FIG. 5 is a block diagram illustrating the normalization method of theinvention.

FIG. 6 is a block diagram illustrating the hybrid spatial normalizationcompression method.

FIG. 7 is a block diagram illustrating the mesh generation method of theinvention employed in local normalization.

FIG. 8 is a block diagram illustrating the mesh based normalizationmethod of the invention employed in local normalization.

FIG. 9 is a block diagram illustrating the combined global and localnormalization method of the invention.

FIG. 10 is a block diagram of a preferred embodiment video compression(image processing, generally) system.

FIG. 11 is a flow diagram illustrating a virtual image sensor of thepresent invention.

FIG. 12 is a block diagram illustrating the background resolutionmethod.

FIG. 13 is a block diagram illustrating the object segmentation methodof the invention.

FIG. 14 is a block diagram illustrating the object interpolation methodof the invention.

DETAILED DESCRIPTION OF THE INVENTION

A description of example embodiments of the invention follows.

In video signal data frames of video are assembled into a sequence ofimages usually depicting a three dimensional scene as projected, imaged,onto a two dimensional imaging surface. Each frame, or image, iscomposed of picture elements (pels) that represent an imaging sensorresponse to the sampled signal. Often the sampled signal corresponds tocome reflected, refracted or emitted energy (e.g., electromagnetic,acoustic, etc.) sampled by a two dimensional sensor array. A successivesequential sampling results in a spatiotemporal data stream with twospatial dimensions per frame and a temporal dimension corresponding tothe frame's order in the video sequence.

The present invention as illustrated in FIG. 2 analyzes signal data andidentifies the salient components. When the signal is comprised of videodata, analysis of the spatiotemporal stream reveals salient componentsthat are often specific objects, such as faces. The identificationprocess qualifies the existence and significance of the salientcomponents and chooses one or more of the most significant of thosequalified salient components. This does not limit the identification andprocessing of other less salient components after or concurrently withthe presently described processing. The aforementioned salientcomponents are then further analyzed, identifying the variant andinvariant subcomponents. The identification of invariant subcomponentsis the process of modeling some aspect of the component, therebyrevealing a parameterization of the model that allows the component tobe synthesized to a desired level of accuracy.

In one embodiment of the invention, a foreground object is detected andtracked. The object's pels are identified and segmented from each frameof the video. The block-based motion estimation is applied to thesegmented object in multiple frames. These motion estimates are thenintegrated into a higher order motion model. The motion model isemployed to warp instances of the object to a common spatialconfiguration. For certain data in this configuration, more of thefeatures of the object are aligned. This normalization allows the lineardecomposition of the values of the object's pels over multiple frames tobe compactly represented. The salient information pertaining to theappearance of the object is contained in this compact representation.

A preferred embodiment of the present invention details the lineardecomposition of a foreground video object. The object is normalizedspatially, thereby yielding a compact linear appearance model. A furtherpreferred embodiment additionally segments the foreground object fromthe background of the video frame prior to spatial normalization.

A preferred embodiment of the invention applies the present invention toa video of a person speaking into a camera while undergoing a smallamount of motion.

A preferred embodiment of the invention applies the present invention toany object in a video that can be represented well through spatialtransformations.

A preferred embodiment of the invention specifically employs block-basedmotion estimation to determine finite differences between two or moreframes of video. A higher order motion model is factored from the finitedifferences in order to provide a more effective linear decomposition.

Detection & Tracking (C1)

It is known in the art to detect an object in a frame and to track thatobject through a predetermined number of later frames. Among thealgorithms and programs that can be used to perform the object detectionfunction is the Viola/Jones: Viola, P. and M. Jones, “Robust Real-timeObject Detection,” Proc. 2nd Int'l. Workshop on Statistical andComputational Theories of Vision—Modeling, Learning, Computing andSampling, Vancouver, Canada, July 2001. Likewise, there are a number ofalgorithms and programs that can be used to track the detected objectthrough successive frames. An example includes Edwards, C. et al.,“Learning to identify and track facts in an image sequence,” Proc.Int'l. Conf Auto. Face and Gesture Recognition, pp. 260-265, 1998.

The result of the object detection process is a data set that specifiesthe general position of the center of the object in the frame and anindication as to the scale (size) of the object. The result of thetracking process is a data set that represents a temporal label for theobject and assures that to a certain level of probability the objectdetected in the successive frames is the same object.

The object detection and tracking algorithm may be applied to a singleobject in the frames or to two or more objects in the frames.

It is also known to track one or more features of the detected object inthe group of sequential frames. If the object is a human face, forexample, the features could be an eye or a nose. In one technique, afeature is represented by the intersection of “lines” that can looselybe described as a “corner”. Preferably, “corners” that are both strongand spatially disparate from each other are selected as features. Thefeatures may be identified through a spatial intensity field gradientanalysis. Employing a hierarchical multi-resolution estimation of theoptical flow allows the determination of the translational displacementof the features in successive frames. Black, M. J. and Y. Yacoob,“Tracking and recognizing rigid and non-rigid facial motions using localparametric models of image motions,” Proceedings of the InternationalConference on Computer Vision, pp. 374-381, Boston, June 1995, is anexample of an algorithm that uses this technique to track features.

Once the constituent salient components of the signal have beendetermined, these components may be retained and all other signalcomponents may be diminished or removed. The process of detecting thesalient component is shown in FIG. 2 where the video frame 202 isprocessed by one or more Detect Object 206 processes, resulting in oneor more objects being identified and subsequently tracked. The retainedcomponents represent the intermediate form of the video data. Thisintermediate data can then be encoded using techniques that aretypically not available to existing video processing methods. As theintermediate data exists in several forms, standard video encodingtechniques can also be used to encode several of these intermediateforms. For each instance, the present invention determines and thenemploys the encoding technique that is most efficient.

In one preferred embodiment, a saliency analysis process detects andclassifies salient signal modes. One embodiment of this process employsa combination of spatial filters specifically designed to generate aresponse signal whose strength is relative to the detected saliency ofan object in the video frame. The classifier is applied at differingspatial scales and in different positions of the video frame. Thestrength of the response from the classifier indicates the likelihood ofthe presence of a salient signal mode. When centered over a stronglysalient object, the process classifies it with a correspondingly strongresponse. The detection of the salient signal mode distinguishes thepresent invention by enabling the subsequent processing and analysis onthe salient information in the video sequence.

Feature Point Tracking (C7)

Given the detection location of a salient signal mode in one or moreframes of video, the present invention analyzes the salient signalmode's invariant features. Additionally, the invention analyzes theresidual of the signal, the “less-salient” signal modes, for invariantfeatures. Identification of invariant features provides a basis forreducing redundant information and segmenting (i.e., separating) signalmodes.

In one embodiment of the present invention, spatial positions in one ormore frames are determined through spatial intensity field gradientanalysis. These features correspond to some intersection of “lines”which can be described loosely as a “corner”. Such an embodiment furtherselects a set of such corners that are both strong corners and spatiallydisparate from each other, herein referred to as the feature points.Further, employing a hierarchical multi-resolution estimation of theoptical flow allows the determination of the translational displacementof the feature points over time.

In FIG. 2 the Track Object 220 process is shown to pull together thedetection instances from the Detect Object processes 208 and furtherIdentify Correspondences 222 of features of one or more of the detectedobjects over a multitude of Video Frames 202 and 204.

A non-limiting embodiment of feature tracking can be employed such thatthe features are used to qualify a more regular gradient analysis methodsuch as block-based motion estimation.

Another embodiment anticipates the prediction of motion estimates basedon feature tracking

Object-Based Detection and Tracking (C1)

In one non-limiting embodiment of the current invention, a robust objectclassifier is employed to track faces in frames of video. Such aclassifier is based on a cascaded response to oriented edges that hasbeen trained on faces. In this classifier, the edges are defined as aset of basic Haar features and the rotation of those features by 45degrees. The cascaded classifier is a variant of the AdaBoost algorithm.Additionally, response calculations can be optimized through the use ofsummed area tables.

1. Local Registration

Registration involves the assignment of correspondences between elementsof identified objects in two or more video frames. These correspondencesbecome the basis for modeling the spatial relationships between videodata at temporally distinct points in the video data.

Various non-limiting means of registration are described for the presentinvention in order to illustrate specific embodiments and theirassociated reductions to practice in terms of well known algorithms andinventive derivatives of those algorithms.

One means of modeling the apparent optical flow in a spatio-temporalsequence can be achieved through generation of a finite difference fieldfrom two or more frames of the video data. Optical flow field can besparsely estimated if the correspondences conform to certain constancyconstraints in both a spatial and an intensity sense. As shown in FIG.3, a Frame (302 or 304) is sub-sampled spatially, possibly through adecimation process (306), or some other sub-sampling process (e.g. lowpass filter). These spatially reduced images (310 & 312) can be furthersub-sampled as well.

Other motion estimation techniques are suitable such as various blockbased motion estimation, mesh based and phase based ones as in relatedU.S. application Ser. No. 11/396,010.

2. Global Registration

In one embodiment, the present invention generates a correspondencemodel by using the relationships between corresponding elements of adetected object in two or more frames of video. These relationships areanalyzed by factoring one or more linear models from a field of finitedifference estimations. The term field refers to each finite differencehaving a spatial position. These finite differences may be thetranslational displacements of corresponding object features indisparate frames of video described in the Detection & Tracking section.The field from which such sampling occurs is referred to herein as thegeneral population of finite differences. The described method employsrobust estimation similar to that of the RANSAC algorithm as describedin: M. A. Fischler, R. C. Bolles. “Random Sample Consensus: A Paradigmfor Model Fitting with Applications to Image Analysis and AutomatedCartography.” Comm. of the ACM, Vol 24, pp 381-395, 1981.

As shown in FIG. 4, the finite differences, in the case of global motionmodeling, are Translational Motion Estimates (402) that are collectedinto a General Population Pool (404) which is iteratively processed by aRandom Sampling of those Motion Estimates (410) and a linear model isfactored out (420) of those samples. The Results are then used to adjustthe population (404) to better clarify the linear model through theexclusion of outliers to the model, as found through the random process.

The present invention is able to utilize one or more robust estimators;one of which may be the RANSAC robust estimation process. Robustestimators are well documented in the prior art.

In one embodiment of the linear model estimation algorithm, the motionmodel estimator is based on a linear least squares solution. Thisdependency causes the estimator to be thrown off by outlier data. Basedon RANSAC, the disclosed method is a robust method of countering theeffect of outliers through the iterative estimation of subsets of thedata, probing for a motion model that will describe a significant subsetof the data. The model generated by each probe is tested for thepercentage of the data that it represents. If there are a sufficientnumber of iterations, then a model will be found that fits the largestsubset of the data. A description of how to perform such robust linearleast squares regression is described in: R. Dutter and P. J. Huber.“Numerical methods for the nonlinear robust regression problem.” Journalof Statistical and Computational Simulation, 13:79-113, 1981.

As conceived and illustrated in FIG. 4, the present invention disclosesinnovations beyond the RANSAC algorithm in the form of alterations ofthe algorithm that involve the initial sampling of finite differences(samples) and least squares estimation of a linear model. Synthesiserror is assessed for all samples in the general population using thesolved linear model. A rank is assigned to the linear model based on thenumber of samples whose residual conforms to a preset threshold. Thisrank is considered the “candidate consensus”.

The initial sampling, solving, and ranking are performed iterativelyuntil termination criteria are satisfied. Once the criteria aresatisfied, the linear model with the greatest rank is considered to bethe final consensus of the population.

An option refinement step involves iteratively analyzing subsets ofsamples in the order of best fit to the candidate model, and increasingthe subset size until adding one more sample would exceed a residualerror threshold for the whole subset.

As shown in FIG. 4, The Global Model Estimation process (450) isiterated until the Consensus Rank Acceptability test is satisfied (452).When the rank has not been achieved, the population of finitedifferences (404) is sorted relative to the discovered model in aneffort to reveal the linear model. The best (highest rank) motion modelis added to a solution set in process 460. Then the model isre-estimated in process 470. Upon completion, the population (404) isre-sorted.

The described non-limiting embodiments of the invention can be furthergeneralized as a general method of sampling a vector space, describedabove as a field of finite difference vectors, in order to determinesubspace manifolds in another parameter vector space that wouldcorrespond to a particular linear model.

A further result of the global registration process is that thedifference between this and the local registration process yields alocal registration residual. This residual is the error of the globalmodel in approximating the local model.

3. Normalization (C1)

Normalization refers to the resampling of spatial intensity fieldstowards a standard, or common, spatial configuration. When theserelative spatial configurations are invertible spatial transformationsbetween such configurations, the resampling and accompanyinginterpolation of pels are also invertible up to a topological limit. Thenormalization method of the present invention is illustrated in FIG. 5.

When more than two spatial intensity fields are normalized, increasedcomputational efficiency may be achieved by preserving intermediatenormalization calculations.

Spatial transformation models used to resample images for the purpose ofregistration, or equivalently for normalization, include global andlocal models. Global models are of increasing order from translationalto projective. Local models are finite differences that imply aninterpolant on a neighborhood of pels as determined basically by a blockor more complexly by a piece-wise linear mesh.

Interpolation of original intensity fields to normalized intensity fieldincreases linearity of PCA appearance models based on subsets of theintensity field.

As shown in FIG. 2, the object pels 232 and 234 can be re-sampled 240 toyield a normalized version of the object pels 242 and 244.

4. Mesh-Based Normalization

A further embodiment of the present invention tessellates the featurepoints into a triangle based mesh, the vertices of the mesh are tracked,and the relative positions of each triangle's vertices are used toestimate the three-dimensional surface normal for the plane coincidentwith those three vertices. When the surface normal is coincident withthe projective axis of the camera, the imaged pels can provide aleast-distorted rendering of the object corresponding to the triangle.Creating a normalized image that tends to favor the orthogonal surfacenormal can produce a pel preserving intermediate data type that willincrease the linearity of subsequent appearance-based PCA models.

Another embodiment utilizes conventional block-based motion estimationto implicitly model a global motion model. In one, non-limitingembodiment, the method factors a global affine motion model from themotion vectors described by the conventional block-based motionestimation/prediction.

The present inventive method utilizes one or more global motionestimation techniques including the linear solution to a set of affineprojective equations. Other projective models and solution methods aredescribed in the prior art.

FIG. 9 illustrates the method of combining global and localnormalization.

5. Local Normalization

The present invention provides a means by which pels in thespatiotemporal stream can be registered in a ‘local’ manner.

One such localized method employs the spatial application of a geometricmesh to provide a means of analyzing the pels such that localizedcoherency in the imaged phenomena are accounted for when resolving theapparent image brightness constancy ambiguities in relation to the localdeformation of the imaged phenomena, or specifically an imaged object.

Such a mesh is employed to provide a piece-wise linear model of surfacedeformation in the image plane as a means of local normalization. Theimaged phenomena may often correspond to such a model when the temporalresolution of the video stream is high compared with the motion in thevideo. Exceptions to the model assumptions are handled through a varietyof techniques, including: topological constraints, neighbor vertexrestrictions, and analysis of homogeneity of pel and image gradientregions.

In one embodiment, feature points are used to generate a meshconstituted of triangular elements whose vertices correspond to thefeature points. The corresponding feature points in other frames implyan interpolated “warping” of the triangles, and correspondingly thepels, to generate a local deformation model.

FIG. 7 illustrates the generation of such an object mesh. FIG. 8illustrates the use of such an object mesh to locally normalize frames.

In one embodiment, a triangle map is generated which identifies thetriangle that each pel of the map comes from. Further, the affinetransform corresponding to each triangle is pre-computed as anoptimization step. And further, when creating the local deformationmodel, the anchor image (previous) is traversed using the spatialcoordinates to determine the coordinates of the source pel to sample.This sampled pel will replace the current pel location.

In another embodiment, local deformation is preformed after globaldeformation. In a previously disclosed specification above, GlobalNormalization was described as the process by which a GlobalRegistration method is used to spatially normalize pels in two or moreframes of video. The resulting globally normalized video frames canfurther be locally normalized. The combination of these two methodsconstrains the local normalization to a refinement of the globallyarrived at solution. This can greatly reduce the ambiguity that thelocal method is required to resolve.

In another non-limiting embodiment, feature points, or in the case of a“regular mesh”—vertex points, are qualified through analysis of theimage gradient in the neighborhood of those points. This image gradientcan be calculated directly, or through some indirect calculation such asa Harris response. Additionally, these points can be filtered by aspatial constraint and motion estimation error associated with a descentof the image gradient. The qualified points can be used as the basis fora mesh by one of many tessellation techniques, resulting in a mesh whoseelements are triangles. For each triangle, an affine model is generatedbased on the points and their residual motion vector.

The present inventive method utilizes one or more image intensitygradient analysis methods, including the Harris response. Other imageintensity gradient analysis methods are described in the prior art.

In an embodiment, a list of the triangles affine parameters ismaintained. The list is iterated and a current/previous point list isconstructed (using the a vertex look up map). The current/previous pointlist is passed to a routine that is used to estimate the transform,which computes the affine parameters for that triangle. The affineparameters, or model, are then saved in the triangle affine parameterlist.

In a further embodiment, the method traverses a triangle identifierimage map, where each pel in the map contains the identifier for thetriangle in the mesh for which the pel has membership. And for each pelthat belongs to a triangle, the corresponding global deformationcoordinates, and local deformation coordinates for that pel arecalculated. Those coordinates, in turn, are used to sample thecorresponding pel and to apply its value in the corresponding“normalize” position.

In a further embodiment, spatial constraints are applied to the pointsbased on density and the image intensity correspondence strengthresulting from the search of the image gradient. The points are sortedafter motion estimation is done based on some norm of the imageintensity residual. Then the points are filtered based on a spatialdensity constraint.

In a further embodiment, spectral spatial segmentation is employed, andsmall homogeneous spectral regions are merged based on spatial affinity,similarity of their intensity and/or color, with neighboring regions.Then homogenous merging is used to combine spectral regions togetherbased on their overlap with a region of homogenous texture (imagegradient). A further embodiment then uses center-surround points, thosewere a small region is surrounded by a larger region, as qualifiedinterest points for the purpose of supporting a vertex point of themesh. In a further non-limiting embodiment, a center surround point isdefined as a region whose bounding box is within one pel of being 3×3 or5×5 or 7×7 pels in dimension, and the spatial image gradient for thatbounding box is a corner shape. The center of the region can beclassified as a corner, further qualifying that position as anadvantageous vertex position.

In a further embodiment, the horizontal and vertical pel finitedifference images are used to classify the strength of each mesh edge.If an edge has many finite differences coincident with its spatialposition, then the edge, and hence the vertices of that edge areconsidered to be highly critical to the local deformation of the imagedphenomena. If there is a large derivative difference between theaverages of the sums of the finite differences of the edge, then mostlylikely the region edge corresponds to a texture change edge, and not aquantization step.

In a further embodiment, a spatial density model termination conditionis employed to optimize the processing of the mesh vertices. When asufficient number of points have been examined that covers most of thespatial area of an outset of the detection rectangle, then theprocessing can be terminated. The termination generates a score. Vertexand feature points entering the processing are sorted by this score. Ifthe point is too spatially close to an existing point, or if the pointdoes not correspond to an edge in the image gradient, then it isdiscarded. Otherwise, the image gradient in the neighborhood of thepoint is descended, and if the residual of the gradient exceeds a limit,then that point is also discarded.

In a preferred embodiment, the local deformation modeling is performediteratively, converging on a solution as the vertex displacements periteration diminish. In another embodiment, local deformation modeling isperformed, and the model parameters are discarded if the globaldeformation has already provided the same normalization benefit.

Other normalization techniques alone or in combination (such asdescribed in related U.S. application Ser. No. 11/396,010) are suitable.

Segmentation

The spatial discontinuities identified through the further describedsegmentation processes are encoded efficiently through geometricparameterization of their respective boundaries, referred to as spatialdiscontinuity models. These spatial discontinuity models may be encodedin a progressive manner allowing for ever more concise boundarydescriptions corresponding to subsets of the encoding. Progressiveencoding provides a robust means of prioritizing the spatial geometrywhile retaining much of the salient aspects of the spatialdiscontinuities.

A preferred embodiment of the present invention combines amulti-resolution segmentation analysis with the gradient analysis of thespatial intensity field and further employs a temporal stabilityconstraint in order to achieve a robust segmentation.

As shown in FIG. 2, once the correspondences of feature of an objecthave been tracked over time 220 and modeled 224, adherence to thismotion/deformation model can be used to segment the pels correspondingto the object 230. This process can be repeated for a multiple ofdetected objects 206 and 208 in the video 202 and 204. The results ofthis processing are the segmented object pels 232.

One form of invariant feature analysis employed by the present inventionis focused on the identification of spatial discontinuities. Thesediscontinuities manifest as edges, shadows, occlusions, lines, cornersor any other visible characteristic that causes an abrupt andidentifiable separation between pels in one or more imaged frames ofvideo. Additionally, subtle spatial discontinuities between similarlycolored and/or textured objects may only manifest when the pels of theobjects in the video frame are undergoing coherent motion relative tothe objects themselves, but different motion relative to each other. Thepresent invention utilizes a combination of spectral, texture and motionsegmentation to robustly identify the spatial discontinuities associatedwith a salient signal mode.

Temporal Segmentation

The temporal integration of translational motion vectors, orequivalently finite difference measurements in the spatial intensityfield, into a higher order motion model is a form of motion segmentationthat is described in the prior art.

In one embodiment of the invention, a dense field of motion vectors isproduced representing the finite differences of object motion in thevideo. These derivatives are grouped together spatially through aregular partitioning of tiles or by some initialization procedure suchas spatial segmentation. The “derivatives” of each group are integratedinto a higher order motion model using a linear least squares estimator.The resulting motion models are then clustered as vectors in the motionmodel space using the k-means clustering technique. The derivatives areclassified based on which cluster best fits them. The cluster labels arethen spatially clustered as an evolution of the spatial partitioning.The process is continued until the spatial partitioning is stable.

In a further embodiment of the invention, motion vectors for a givenaperture are interpolated to a set of pel positions corresponding to theaperture. When the block defined by this interpolation spans pelscorresponding to an object boundary, the resulting classification issome anomalous diagonal partitioning of the block.

In the prior art, the least squares estimator used to integrate thederivatives is highly sensitive to outliers. The sensitivity cangenerate motion models that heavily bias the motion model clusteringmethod to the point that the iterations diverge widely.

In the present invention the motion segmentation methods identifyspatial discontinuities through analysis of apparent pel motion over twoor more frames of video. The apparent motion is analyzed for consistencyover the frames of video and integrated into parametric motion models.Spatial discontinuities associated with such consistent motion areidentified. Motion segmentation can also be referred to as temporalsegmentation, because temporal changes may be caused by motion. However,temporal changes may also be caused by some other phenomena such aslocal deformation, illumination changes, etc.

Through the described method, the salient signal mode that correspondsto the normalization method can be identified and separated from theambient signal mode (background or non-object) through one of severalbackground subtraction methods. Often, these methods statistically modelthe background as the pels that exhibit the least amount of change ateach time instance. Change can be characterized as a pel valuedifference.

Segmentation perimeter-based global deformation modeling is achieved bycreating a perimeter around the object then collapsing the perimetertoward the detected center of the object until perimeter vertices haveachieved a position coincident with a heterogeneous image gradient.Motion estimates are gathered for these new vertex positions and robustaffine estimation is used to find the global deformation model.

Segmented mesh vertex image gradients (in particular, descent-basedfinite differences) are integrated into a global deformation model.

Object Segmentation

The block diagram shown in FIG. 13 shows one embodiment of objectsegmentation. The process shows begins with an ensemble of normalizedimages 1302 that are then pair-wise differenced 1304 among the ensemble.These differences are then element-wise accumulated 1306 into anaccumulation buffer. The accumulation buffer is thresholded 1310 inorder to identify the more significant error regions. The thresholdedelement mask is then morphologically analyzed 1312 in order to determinethe spatial support of the accumulated error regions 1310. The resultingextraction 1314 of the morphological analysis 1312 is then compared withthe detected object position 1320 in order to focus subsequentprocessing on accumulated error regions that are coincident with theobject. The isolated spatial region's 1320 boundary is then approximatedwith a polygon 1322 of which a convex hull is generated 1324. Thecontour of the hull is then adjusted 1332 in order to better initializethe vertices' positions for active contour analysis 1332. Once theactive contour analysis 1332 has converged on a low energy solution inthe accumulated error space, the contour is used as the final contour1334 and the pels constrained in the contour are considered those thatare most likely object pels, and those pels outside of the contour areconsidered to be non-object pels.

In a one embodiment, motion segmentation can be achieved given thedetected position and scale of the salient image mode. A distancetransform can be used to determine the distance of every pel from thedetected position. If the pel values associated with the maximumdistance are retained, a reasonable model of the background can beresolved. In other words, the ambient signal is re-sampled temporallyusing a signal difference metric.

A further embodiment includes employing a distance transform relative tothe current detection position to assign a distance to each pel. If thedistance to a pel is greater than the distance in some maximum peldistance table, then the pel value is recorded. After a suitabletraining period, the pel is assumed to have the highest probability ofbeing a background pel if the maximum distance for that pel is large.

Given a model of the ambient signal, the complete salient signal mode ateach time instance can be differenced. Each of these differences can bere-sampled into spatially normalized signal differences (absolutedifferences). These differences are then aligned relative to each otherand accumulated. Since these differences have been spatially normalizedrelative to the salient signal mode, peaks of difference will mostlycorrespond to pel positions that are associated with the salient signalmode.

In one embodiment of the invention, a training period is defined whereobject detection positions are determined and a centroid of thesepositions is used to determine optimal frame numbers with detectionpositions far from this position that would allow for framingdifferencing to yield background pels that would have the highestprobability of being non-object pels.

In one embodiment of the present invention, active contour modeling isused to segment the foreground object from the non-object background bydetermining contour vertex positions in accumulated error “image”. In apreferred embodiment the active contour edges are subdividedcommensurate with the scale of the detected object to yield a greaterdegree of freedom. In a preferred embodiment, the final contourpositions can be snapped to a nearest regular mesh vertex in order toyield a regularly spaced contour.

In one non-limited embodiment of object segmentation, an oriented kernelis employed for generating error image filter responses for temporallypair-wise images. Response to the filter that is oriented orthogonal tothe gross motion direction tends to enhance the error surface whenmotion relative to the background occurs from occlusion and revealing ofthe background.

The normalized image frame intensity vectors of an ensemble ofnormalized images are differenced from one or more reference framecreating residual vectors. These residual vectors are accumulatedelement-wise to form an accumulated residual vector. This accumulatedresidual vector is then probed spatially in order to define a spatialobject boundary for spatial segmentation of the object and non-objectpels.

In one embodiment, an initial statistical analysis of the accumulatedresidual vector is performed to arrive at a statistical threshold valuethat can be used to threshold the accumulated residual vector. Throughan erosion and subsequent dilation morphological operation, apreliminary object region mask is created. The contour polygon points ofthe region are then analyzed to reveal the convex hull of these points.The convex hull is then used as an initial contour for an active contouranalysis method. The active contour is then propagated until itconverges on the spatial boundaries of the object's accumulatedresidual. In a further preferred embodiment, the preliminary contour'sedges are further subdivided by adding midpoint vertices until a minimaledge length is achieved for all the edge lengths. This furtherembodiment is meant to increase the degrees of freedom of the activecontour model to more accurately fit the outline of the object.

In at least one embodiment, the refined contour is used to generate apel mask indicating the pels of the object by overlaying the polygonimplied by the contour and overlaying the polygon in the normalizedimages.

6. Resolution of Non-Object

The block diagram shown in FIG. 12 discloses one embodiment ofnon-object segmentation, or equivalently background resolution. With theinitialization of a background buffer (1206) and an initial maximumdistance value (1204) buffer, the process works to determine the moststable non-object pels by associating “stability” with the greatestdistance from the detected object position (1202). Given a new detectedobject position (1202), the process checks each pel position (1210). Foreach pel position (1210), the distance from the detected object position(1210) is calculated using a distance transform. If the distance forthat pel is greater (1216) than the previously stored position in themaximum distance buffer (1204) then the previous value is replace withthe current value (1218) and the pel value is recorded (1220) in the pelbuffer.

Given a resolved background image, the error between this image and thecurrent frame can be normalized spatially and accumulated temporally.Such a resolved background image is described in the “backgroundresolution” section. The resolution of the background through thismethod is considered a time-based occlusion filter process.

The resulting accumulated error is then thresholded to provide aninitial contour. The contour is then propagated spatially to balanceerror residual against contour deformation.

In an alternative embodiment, absolute differences between the currentframe and the resolved background frames is computed. The element-wiseabsolute difference is then segmented into distinct spatial regions.These regions bounding boxes average pel value is computed, so that whenthe resolved background is updated, the difference between the currentand resolved background average pel value can be used to perform acontrast shift, so that the current region can blend in more effectivelywith the resolved background. In another embodiment, the vertices withinthe normalized frame mask are motion estimated and saved for each frame.These are then processed using SVD to generate a local deformationprediction for each of the frames.

Other segmentation methods and mechanisms, e.g., textual, spectral andbackground, are employed in preferred embodiments as described inrelated U.S. application Ser. No. 11/396,010.

Appearance Variance Modeling

A common goal of video processing is often to model and preserve theappearance of a sequence of video frames. The present invention is aimedat allowing constrained appearance modeling techniques to be applied inrobust and widely applicable ways through the use of preprocessing. Theregistration, segmentation and normalization described previously areexpressly for that purpose.

The present invention discloses a means of appearance variance modeling.The primary basis of the appearance variance modeling is, in the case ofa linear model, the analysis of feature vectors to reveal compact basisexploiting linear correlations. Feature vectors representing spatialintensity field pels can be assembled into an appearance variance model.

In an alternative embodiment, the appearance variance model iscalculated from a segmented subset of the pels. Further, the featurevector can be separated into spatially non-overlapping feature vectors.Such spatial decomposition may be achieved with a spatial tiling.Computational efficiency may be achieved through processing thesetemporal ensembles without sacrificing the dimensionality reduction ofthe more global PCA method.

When generating an appearance variance model, spatial intensity fieldnormalization can be employed to decrease PCA modeling of spatialtransformations.

Deformation Modeling

Local deformation can be modeled as vertex displacements and aninterpolation function can be used to determine the resampling of pelsaccording to vertices that are associated with those pels. These vertexdisplacements may provide a large amount of variation in motion whenlooked at as a single parameter set across many vertices. Correlationsin these parameters can greatly reduce the dimensionality of thisparameter space.

PCA

The preferred means of generating an appearance variance model isthrough the assembly of frames of video as pattern vectors into atraining matrix, or ensemble, and application of Principal ComponentAnalysis (PCA) on the training matrix. When such an expansion istruncated, the resulting PCA transformation matrix is employed toanalyze and synthesize subsequent frames of video. Based on the level oftruncation, varying levels of quality of the original appearance of thepels can be achieved.

The specific means of construction and decomposition of the patternvectors is well known to one skilled in the art.

Given the spatial segmentation of the salient signal mode from theambient signal and the spatial normalization of this mode, the pelsthemselves, or equivalently, the appearance of the resulting normalizedsignal, can be factored into linearly correlated components with a lowrank parameterization allowing for a direct trade off betweenapproximation error and bit rate for the representation of the pelappearance. One method for achieving a low rank approximation is throughthe truncation of bytes and/or bits of encoded data. A low rankapproximation is considered a compression of the original data asdetermined by the specific application of this technique. For example,in video compression, if the truncation of data does not unduly degradethe perceptual quality, then the application specific goal is achievedalong with compression.

As shown in FIG. 2, the normalized object pels 242 and 244 can beprojected into a vector space and the linear correspondences can bemodeled using a decomposition process 250 such as PCA in order to yielda dimensionally concise version of the data 252 and 254.

PCA and Precision Analysis

The present invention employs a statistical analysis to determine anapproximation of the normalized pel data. This approximation is the“encoded” form of the normalized pel data. The statistical analysis isachieved through a linear decomposition of the normalized pel data,specifically implemented as a Singular Value Decomposition (SVD) whichcan be generally referred to as Principal Component Analysis (PCA) inthis case. The result of this operation is a set of one or more basisvectors. These basis vectors can be used to progressively describe evermore accurate approximations of the normalized pel data. As such, thetruncation of one or more of the least significant basis vectors isperformed to produce an encoding that is sufficient to represent thenormalized pel data to a required quality level.

In general, PCA cannot be effectively applied to the original videoframes. But, once the frames have been segmented and further normalized,the variation in the appearance of the pels in those frames no longerhas the interference of background pels or the spatial displacementsfrom global motion. Without these two forms of variation, PCA is able tomore accurately approximate the appearance of this normalized pel datausing fewer basis vectors than it would otherwise. The resulting benefitis a very compact representation, in terms of bandwidth, of the originalappearance of the object in the video.

The truncation of basis vectors can be performed in several ways, andeach truncation is considered to be a form of precision analysis whencombined with PCA. This truncation can simply be the described exclusionof entire basis vectors from the set of basis vectors. Alternatively,the vector element and/or element bytes and/or bits of those bytes canbe selectively excluded (truncated). Further, the basis vectorsthemselves can be transformed into alternate forms that would allow evenmore choices of truncation methods. Wavelet transform using an EmbeddedZero Tree truncation is one such form.

Method

Normalized pel data from 242 and 244 in FIG. 2 are reorganized intopattern vectors that are assembled into an ensemble of vectors that isdecomposed into a set of basis vectors using PCA, or more specifically,SVD.

Least significant basis vectors are then removed (truncated) from theset of basis vectors to achieve a quality requirement.

Individual normalized pel data associated with each frame produces anencoded pattern vector when projected onto the truncated basis vectors.This encoded pattern vector is the encoded form of the normalized peldata, referred to as the encoded pel data. Note that the normalized peldata also needs to be reorganized into a “pattern vector” prior to beingprojected on the basis.

The encoded pel data can be decoded by projecting it onto the inversionof the basis vectors. This inverse projection yields an approximation(synthesis) of the original normalized pel data 242, 245.

Uses

Generating normalized pel data and further reducing it to the encodedpel data provides a data representation of the appearance of the peldata in the original video frame. This representation can be useful inand of itself, or as input for other processing. The encoded data may becompact enough to provide an advantageous compression ratio overconventional compression without further processing.

The encoded data may be used in place of the “transform coefficients” inconventional video compression algorithms. In a conventional videocompression algorithm, the pel data is “transform encoded” using aDiscrete Cosine Transform (DCT). The resulting “transform coefficients”are then further processed using quantization and entropy encoding.Quantization is a way to lower the precision of the individualcoefficients. Entropy encoding is a lossless compression of thequantized coefficients and can be thought of in the same sense aszipping a file. The present invention is generally expected to yield amore compact encoded vector than DCT, thereby allowing a highercompression ratio when used in a conventional codec algorithm.

In one embodiment, the invention system alternates between encodingvideo frames as described in U.S. patent application Ser. No. 11/191,562and the above-described approximation encoding. The system alternates asa function of least used bandwidth.

Sequential PCA

PCA encodes patterns into PCA coefficients using a PCA transform. Thebetter the patterns are represented by the PCA transform, the fewercoefficients are needed to encode the pattern. Recognizing that patternvectors may degrade as time passes between acquisition of the trainingpatterns and the patterns to be encoded, updating the transform can helpto counteract the degradation. As an alternative to generating a newtransform, sequential updating of existing patterns is morecomputationally efficient in certain cases.

Many state of the art video compression algorithms predict a frame ofvideo from one or more other frames. The prediction model is commonlybased on a partitioning of each predicted frame into non-overlappingtiles which are matched to a corresponding patch in another frame and anassociated translational displacement parameterized by an offset motionvector. This spatial displacement, optionally coupled with a frameindex, provides the “motion predicted” version of the tile. If the errorof the prediction is below a certain threshold, the tile's pels aresuitable for residual encoding; and there is a corresponding gain incompression efficiency. Otherwise, the tile's pels are encoded directly.This type of tile-based, alternatively termed block-based, motionprediction method models the video by translating tiles containing pels.When the imaged phenomena in the video adheres to this type of modeling,the corresponding encoding efficient increases. This modeling constraintassumes a certain level of temporal resolution, or number of frames persecond, is present for imaged objects undergoing motion in order toconform to the translational assumption inherent in block-basedprediction. Another requirement for this translational model is that thespatial displacement for a certain temporal resolution must be limited;that is, the time difference between the frames from which theprediction is derived and the frame being predicted must be a relativelyshort amount of absolute time. These temporal resolution and motionlimitations facilitate the identification and modeling of certainredundant video signal components that are present in the video stream.

In the present invention method, sequential PCA is combined withembedded zero-tree wavelet to further enhance the utility of the hybridcompression method. The sequential PCA technique provides a means bywhich conventional PCA can be enhanced for signals that have a temporalcoherency or temporally local smoothness. The embedded zero-tree waveletprovides a means by which a locally smooth spatial signal can bedecomposed into a space-scale representation in order to increase therobustness of certain processing and also the computational efficiencyof the algorithm. For the present invention, these two techniques arecombined to increase the representation power of the variance models andalso provide a representation of those models that is compact andordered such that much of the representational power of the basis isprovided by a truncation of the basis.

In another embodiment, sequential PCA is applied with a fixed inputblock size and fixed tolerance to increase the weighted bias to thefirst and most energetic PCA components. For longer data sequences thisfirst PCA component is often the only PCA component. This affects thevisual quality of the reconstruction and can limit the utility of thedescribed approach in some ways. The present invention employs adifferent norm for the selection of PCA components that is preferable tothe use of the conventionally used least-square norm. This form of modelselection avoids the over-approximation by the first PCA component.

In another embodiment, a block PCA process with a fixed input block sizeand prescribed number of PCA components per data block is employed toprovide beneficial uniform reconstruction traded against usingrelatively more components. In a further embodiment, the block PCA isused in combination with sequential PCA, where block PCA reinitializesthe sequential PCA after a set number of steps with a block PCA step.This provides a beneficial uniform approximation with a reduction in thenumber of PCA components.

In another embodiment, the invention capitalizes on the situation wherethe PCA components before and after encoding-decoding are visuallysimilar. The quality of the image sequence reconstructions before andafter encoding-decoding may also be visually similar and this oftendepends on the degree of quantization employed. The present inventivemethod decodes the PCA components and then renormalizes them to have aunit norm. For moderate quantization the decoded PCA components areapproximately orthogonal. At a higher level of quantization the decodedPCA components are partially restored by application of SVD (not spelledout anywhere) to obtain an orthogonal basis and modified set ofreconstruction coefficients.

In another embodiment, a variable and adaptable block size is appliedwith a hybrid sequential PCA method in order to produce improved resultswith regard to synthesis quality. The present invention bases the blocksize on a maximum number of PCA components and a given error tolerancefor those blocks. Then, the method expands the current block size untilthe maximum number of PCA components is reached. In a furtherembodiment, the sequence of PCA components is considered as a datastream, which leads to a further reduction in the dimensionality. Themethod performs a post-processing step where the variable data blocksare collected for the first PCA component from each block and SVD isapplied to further reduce the dimensionality. The same process is thenapplied to the collection of second, third, etc. components.

Various decomposition methods and mechanisms may be employed includingbut not limited to power factorization, generalized PCA, progressive PCAand combinations thereof. Examples are described in related U.S. patentapplication Ser. No. 11/396,010.

7. Sub-Band Temporal Quantization

An alternative embodiment of the present invention uses discrete cosinetransform (DCT) or discrete wavelet transform (DWT) to decompose eachframe into sub-band images. Principal component analysis (PCA) is thenapplied to each of these “sub-band” videos (images). The concept is thatsub-band decomposition of a frame of video decreases the spatialvariance in any one of the sub-bands as compared with the original videoframe.

For video of a moving object (person), the spatial variance tends todominate the variance modeled by PCA. Sub-band decomposition reduces thespatial variance in any one decomposition video.

For DCT, the decomposition coefficients for any one sub-band arearranged spatially into a sub-band video. For instance, the DCcoefficients are taken from each block and arranged into a sub-bandvideo that looks like a postage stamp version of the original video.This is repeated for all the other sub-bands, and the resulting sub-bandvideos are each processed using PCA.

For DWT, the sub-bands are already arranged in the manner described forDCT. In a non-limiting embodiment, the truncation of the PCAcoefficients is varied.

Wavelet

When a data is decomposed using the discrete wavelet transformation(DWT), multiple band pass data sets result at lower spatial resolutions.The transformation process can be recursively applied to the deriveddata until only single scalar values results. The scalar elements in thedecomposed structure are typically related in a hierarchicalparent/child fashion. The resulting data contains a multi resolutionhierarchical structure and also finite differences as well.

When DWT is applied to spatial intensity fields, many of the naturallyoccurring images' phenomena are represented with little perceptual lossby the first or second low band pass derived data structures due to thelow spatial frequency. Truncating the hierarchical structure provides acompact representation when high frequency spatial data is either nopresent or considered noise.

While PCA may be used to achieve accurate reconstruction with a smallnumber of coefficients, the transform itself can be quite large. Toreduce the size of this “initial” transform, an embedded zero tree (EZT)construction of a wavelet decomposition can be used to build aprogressively more accurate version of the transformation matrix.

In a preferred embodiment, PCA is applied to normalized video datafollowed by DWT or other Wavelet transform. This results in compressedvideo data that retains saliency of video image objects.

Sub-Space Classification

As is well understood by those practiced in the art, discretely sampledphenomena data and derivative data can be represented as a set of datavectors corresponding to an algebraic vector space. These data vectorsinclude, in a non-limiting way, the pels in the normalized appearance ofthe segmented object, the motion parameters and any structural positionsof features or vertices in two or three dimensions. Each of thesevectors exists in a vector space, and the analysis of the geometry ofthe space can be used to yield concise representations of the sampled,or parameter, vectors. Beneficial geometric conditions are typified byparameter vectors that form compact subspaces. When one or moresubspaces are mixed, creating a seemingly more complex single subspace,the constituent subspaces can be difficult to discern. There are severalmethods of segmentation that allow for the separation of such subspacesthrough examining the data in a higher dimensional vector space that iscreated through some interaction of the original vectors (such as innerproduct).

8. Feature Subspace Classification

A feature subspace is constructed using a DCT decomposition of theregion associated with an object. Each resulting coefficient matrix isconverted into a feature vector. These feature vectors are thenclustered spatially in the resulting vector space.

The clustering provides groups of image object instances that can benormalized globally and locally toward some reference object instance.These normalized object instances can then be used as an ensemble forPCA.

In one preferred embodiment, the DCT matrix coefficients are summed asthe upper and lower triangles of a matrix. These sums are considered aselements of a two dimensional vector.

In one preferred embodiment, the most dense cluster is identified andthe vectors most closely associated with the cluster are selected. Thepels associated with the object instances corresponding to these pelsare considered most similar to each other. The selected vectors can thenbe removed from the subspace and a re-clustering can yield another setof related vectors corresponding to related object instances.

In a further embodiment, the image object instances associated with theidentified cluster's vectors are globally normalized toward the clustercentroid. Should the resulting normalization meet the distortionrequirements, then the object instance is considered to be similar tothe centroid. A further embodiment allows for failing object instancesto be returned to the vector space to be candidates for furtherclustering.

In another embodiment, clusters are refined by testing their membershipagainst the centroids of other clustered object instances. The result isthat cluster membership may change and therefore yield a refinement thatallows for the clusters to yield object instance images that are mostsimilar.

9. Ensemble Processing

The present inventive method may utilize an ensemble selection andprocessing. The method selects a small subset of images from thecandidate training pool based on the deformation distance of the imagesfrom the key image in the pool.

In a preferred embodiment, the DCT intra cluster distance is used as themeans of determining which of the candidate images will be used torepresent the variance in the cluster.

A further embodiment projects images from different clusters intodifferent PCA spaces in order to determine ensemble membership of theremaining images. The projection is preceded by a global and localnormalization of the image relative to the key ensemble image or theensemble average.

10. Object Encoding

One embodiment of the invention performs a Fourier subspaceclassification on an instance of a detected object to identify one ormore candidate ensembles for encoding the object instance. The closestmatching ensembles are then further qualified through global and localnormalization of the image relative to the key ensemble image or theensemble average. Upon identification of the ensemble for an image, thenormalized image is then segmented and decomposed using the ensemblebasis vectors. The resulting coefficients are the decomposedcoefficients corresponding to the original object at the instance oftime corresponding to the frame containing the object. Thesecoefficients are also referred to as the appearance coefficients.

11. Sequence Reduction

The present inventive method has a means for further reducing the codingof images utilizing an interpolation of the decomposed coefficients. Thetemporal stream is analyzed to determine if sequences of the appearanceand/or deformation parameters have differentials that are linear. Ifsuch is the case, then only the first and last parameters are sent withan indication that the intermediate parameters are to be linearlyinterpolated.

12. Tree Ensemble

The present invention has a preferred embodiment in which the ensembleis organized into a dependency tree that is brached based on similarityof pattern vectors. The “root” of the tree is established as the keypattern of the ensemble. Additional ensemble patterns are added to thetree and become “leaves” of the tree. The additional patterns are placedas dependents to whichever tree node is most similar to the pattern. Inthis way the ensemble patterns are organized such that a dependencystructure is created based on similarity. This structure is utilized asan alternative to “Sequence Reduction”, providing the same method withthe difference that in stead of interpolating a sequence of patternvectors, a traversal of the tree is used as an alternative to thetemporal ordering.

Hybrid Spatial Normalization Compression

The present invention extends the efficiency of block-based motionpredicted coding schemes through the addition of segmenting the videostream into two or more “normalized” streams. These streams are thenencoded separately to allow the conventional codec's translationalmotion assumptions to be valid. Upon decoding the normalized streams,the streams are de-normalized into their proper position and compositedtogether to yield the original video sequence.

In one embodiment, one or more objects are detected in the video streamand the pels associated with each individual object are subsequentlysegmented leaving non-object pels. Next, a global spatial motion modelis generated for the object and non-object pels. The global model isused to spatially normalize object and non-object pels. Such anormalization has effectively removed the non-translational motion fromthe video stream and has provided a set of videos whose occlusioninteraction has been minimized. These are both beneficial features ofthe present inventive method.

The new videos of object and the non-object, having their pels spatiallynormalized, are provided as input to a conventional block-basedcompression algorithm. Upon decoding of the videos, the global motionmodel parameters are used to de-normalize those decoded frames, and theobject pels are composited together and onto the non-object pels toyield an approximation of the original video stream.

As shown in FIG. 6, the previously detected object instances 206 and 208for one or more objects 630 and 650 are each processed with a separateinstance of a conventional video compression method 632. Additionally,the non-object 602 resulting from the segmentation 230 of the objects,is also compressed using conventional video compression 632. The resultof each of these separate compression encodings 632 are separateconventional encoded streams for each 634 corresponding to each videostream separately. At some point, possibly after transmission, theseintermediate encoded streams 234 can be decompressed 636 into asynthesis of the normalized non-object 610 and a multitude of objects638 and 658. These synthesized pels can be de-normalized 640 into theirde-normalized versions 622, 642 and 662 to correctly position the pelsspatially relative to each other so that a compositing process 670 cancombine the object and non-object pels into a synthesis of the fullframe 672.

Two of the most prevalent compression techniques are the discrete cosinetransform (DCT) and discrete wavelet transform (DWT). Error in the DCTtransform manifests in a wide variation of video data values, andtherefore, the DCT is typically used on blocks of video data in order tolocalize these false correlations. The artifacts from this localizationoften appear along the border of the blocks. In DWT, more complexartifacts occur when there is a mismatch between the basis function andcertain textures, and this causes a blurring effect. To counteract thenegative effects of DCT and DWT, the precision of the representation isincreased to lower distortion at the cost of precious bandwidth.

In accordance with the present invention, a video image compressionmethod (image processing method in general) is provided, which combinesprincipal component analysis (PCA) and wavelet compression. In apreferred embodiment, parallel basis are built at both the sender andthe receiver. With the present technique, the parallel basis becomes theoriginal frames (anchor frames) used in the coding and decodingprocesses 632, 636. Specifically, basis information is sent to thereceiver and is used to replicate the basis for additional frames. Atencoder 634, PCA is applied and the dataset is reduced by applying awavelet transform, while the basis is being transmitted. In particular,the PCA to wavelet compression process is an intermediate step thatoccurs while the basis are being transmitted to the receiver.

In another embodiment, a switching between encoding modes is performedbased on a statistical distortion metric, such as PSNR (peak signal tonoise ratio), that would allow conventional versus the subspace methodto encode the frames of video.

In another embodiment of the invention, the encoded parameters of theappearance, global deformation (structure, motion and pose) and localdeformation are interpolated to yield predictions of intermediate framesthat would not otherwise have to be encoded. The interpolation methodcan be any of the standard interpolation methods such as linear, cubic,spline, etc.

As shown in FIG. 14, the object interpolation method can be achievedthrough the interpolation analysis 1408 of a series of normalizedobjects 1402, 1404 and 1406 as represented by appearance and deformationparameters. The analysis determines the temporal range 1410 over whichan interpolating function can be applied. The range specification 1410can then be combined with the normalized object specifications 1414 and1420 in order to approximate and ultimately synthesize the interimnormalized objects 1416 and 1418.

Integration of Hybrid Codec

In combining a conventional block-based compression algorithm and anormalization-segmentation scheme, as described in the presentinvention, there are several inventive methods that have resulted.Primarily, there are specialized data structures and communicationprotocols that are required.

The primary data structures include global spatial deformationparameters and object segmentation specification marks. The primarycommunication protocols are layers that include the transmission of theglobal spatial deformation (global structural model) parameters andobject segmentation specification masks.

Global Structure, Global Motion and Local Deformation NormalizationCompression

In a preferred embodiment, the foregoing PCA/wavelet encoding techniquesare applied to a preprocessed video signal to form a desired compressedvideo signal. The preprocessing reduces complexity of the video signalin a manner that enables PCA/wavelet encoding (compression) to beapplied with increased effect. The image processing system 1500 of FIG.10 is illustrative.

In FIG. 10, source video signal 1501 is input to or otherwise receivedby a preprocessor 1502. The preprocessor 1502 uses bandwidth consumptionto determine components of interest (salient objects) in the sourcevideo signal 1501. In particular, the preprocessor 1502 determinesportions of the video signal which use disproportionate bandwidthrelative to other portions of the video signal 1501. One method orsegmenter 1503 for making this determination is as follows.

Segmenter 1503 analyzes an image gradient over time and/or space usingtemporal and/or spatial differences in derivatives of pels as describedabove. In coherence monitoring, parts of the video signal thatcorrespond to each other across sequential frames of the video signalare tracked and noted. The finite differences of the derivative fieldsassociated with these coherent signal components are integrated toproduce the determined portions of the video signal which usedisproportionate bandwidth relative to other portions (i.e., determinesthe components of interest). In a preferred embodiment, if a spatialdiscontinuity in one frame is found to correspond to a spatialdiscontinuity in a succeeding frame, then the abruptness or smoothnessof the image gradient is analyzed to yield a unique correspondence(temporal coherency). Further collections of such correspondences arealso employed in the same manner to uniquely attribute temporalcoherency of discrete components of the video frames. For an abruptimage gradient, an edge is determined to exist. If two such edgedefining spatial discontinuities exist then a corner is defined. Theseidentified spatial discontinuities are combined with the gradient flowwhich produces motion vectors between corresponding pels across framesof the video data. When a motion vector is coincident with an identifiedspatial discontinuity, then the invention segmenter 1503 determines thata component of interest (salient object) exists.

Other segmentation techniques as described in previous sections aresuitable for implementing segmenter 1503. For example, face/objectdetection may be used.

Returning to FIG. 10, once the preprocessor 1502 (segmenter 1503) hasdetermined the components of interest (salient objects) or otherwisesegmented the same from the source video signal 1501, a normalizer 1505reduces the complexity of the determined components of interest.Preferably the normalizer 1505 removes variance of global motion andpose, global structure, local deformation, appearance and illumination(appearance variance) from the determined components of interest. Thenormalization techniques previously described herein are utilized towardthis end. This results in the normalizer 1505 establishing a structuralmodel 1507 and an appearance model 1508 of the components of interest.

The structural model 1507 may be mathematically represented as:

$\begin{matrix}{{{SM}(\sigma)} = {\sum\limits_{x,y}\lbrack {( {v_{x,y} + \Delta_{t}} ) + Z} \rbrack}} & {{Equation}\mspace{14mu} 1}\end{matrix}$

where σ is the salient object (determined component of interest) and SM( ) is the structural model of that object;

v_(x,y) are the 2D mesh vertices of a piece-wise linear regularized meshover the object σ registered over time (discussed above);

Δ_(t) are the changes in the vertices relative to each other over time trepresenting scaling (or local deformation), rotation and translation ofthe object between video frames; and

Z is global motion (i.e., movement of the whole meshing and deformationof the mesh). In some embodiments, Z represents position of the 2D meshin space and pose of the mesh represented by three rotationalparameters.

From Equation 1, applicant derives a global rigid structural model,global motion, pose and locally derives deformation of the model asdiscussed in FIG. 4. Rigid local deformation aspects are defined byposition of each mesh vertex in space. Non-rigid local deformation isexpressed in correlation of the vertices across video frames. Indendentmotion of the vertices is also correlated, resulting in a low(efficient) dimension motion model. Known techniques for estimatingstructure from motion are employed and are combined with motionestimation to determine candidate structure for the structural parts ofthe component of interest of the video frame over time. This results indefining the position and orientation of the salient object in space andhence provides a structural model 1507 and motion model 1506.

In one embodiment, motion estimates are constrained by deformationmodels, a structural model 1507 and illumination (appearance variance)model. Structure from motion techniques are used to determine changes inobject pose/position from one video frame to another. An LRLS (seebelow) or other bilinear tracker tracks the candidate object structureover time. The tracker determines object pose/position changes (Δ's) foreach frame as predictions to the 2D motion estimation.

The appearance model 1508 then represents characteristics and aspects ofthe salient object which are not collectively modeled by the structuralmodel 1507 and motion model 1506. In one embodiment, the appearancemodel 1508 is a linear decomposition of structural changes over time andis defined by removing global motion and local deformation from thestructural model 1507. Applicant takes object appearance at each videoframe and using the structural model 1507 reprojects to a “normalizedpose”. The “normalized pose” is also referred to as one or more“cardinal” poses. The reprojection represents a normalized version ofthe object and produces any variation in appearance. As the given objectrotates or is spatially translated between video frames, the appearanceis positioned in a single cardinal pose (i.e., the average normalizedrepresentation). The appearance model 1508 also accounts for cardinaldeformation of a cardinal pose (e.g., eyes opened/closed, mouthopened/closed, etc.) Thus appearance model 1508 AM(σ) is represented bycardinal pose P_(c) and cardinal deformation Δ_(c) in cardinal poseP_(c),

$\begin{matrix}{{{AM}(\sigma)} = {\sum\limits_{t}( {P_{c} + {\Delta_{c}P_{c}}} )}} & {{Equation}\mspace{14mu} 2}\end{matrix}$

The pels in the appearance model 1508 are preferably biased based ontheir distance and angle of incidence to camera projection axis. Biasingdetermines the relative weight of the contribution of an individual pelto the final formulation of a model. Tracking of the candidate structure(from the structural model 1507) over time can form or enable aprediction of the motion of all pels by implication from a pose, motionand deformation estimates. This is due in part by the third dimension(Z) in the structural model 1507. That third dimension allows for the 2Dmesh to be tracked over more video frames combining more objects fromdifferent frames to be represented by the same appearance model 1508.Further, the third dimension allows for the qualification of theoriginal pels relative to their orientation with the sensor array of thecamera. This information is then used to determine how much anyparticular pel contributes to an appearance model 1508.

Lastly, object appearance is normalized from different frames based oneach dimension. That is, the present invention resolves three dimensionsand preferably uses multiple normalization planes to model theappearance. For example, normalizer 1505 removes variance of globalmotion (Z) and pose, global structure, local deformation andillumination (appearance variance) as described above.

Further, with regard to appearance variance (appearance and illuminationmodeling), one of the persistent challenges in image processing has beentracking objects under varying lighting conditions. In image processing,contrast normalization is a process that models the changes range ofpixel intensity values as attributable to changes in thelighting/illumination rather than to other factors. The preferredembodiment estimates a salient object's arbitrary changes inillumination conditions under which the video was captured (i.e.,modeling, illumination incident on the object). This is achieved bycombining principles from Lambertian Reflectance Linear Subspace (LRLS)theory with optical flow. According to the LRLS theory, when an objectis fixed only allowing for illumination changes, the set of thereflectance images can be approximated by a linear combination of thefirst nine spherical harmonics; thus the image lies close to a 9D linearsubspace in an ambient “image” vector space. In addition, thereflectance intensity I for an image pixel (x,y) can be approximated as

${{I( {x,y} )} = {\sum\limits_{{i = 0},1,{{2j} = {- i}},}{\sum\limits_{{{- i} + {1\mspace{14mu} \ldots \mspace{14mu} i} - 1},i}{l_{ij}{b_{ij}(n)}}}}},$

In accordance with aspects of the present invention, using LRLS andoptical flow, expectations are computed about how lighting interactswith the object. These expectations serve to constrain the possibleobject motion that can explain changes in the optical flow field. Whenusing LRLS to describe the appearance of the object using illuminationmodeling, it is still necessary to allow an appearance model to handleany appearance changes that may fall outside of the illumination model'spredictions.

With the present technique, a succeeding video frame in a sequence offrames can be predicted and then principal component analysis (PCA) canbe performed. In this way, a very generalized form of the image data canbe built and then PCA can be performed on the remainder of the data.

Other mathematical representations of the appearance model 1508 andstructural model 1507 are suitable as long as the complexity of thecomponents of interest is substantially reduced from the correspondingoriginal video signal but saliency of the components of interest ismaintained.

Returning to FIG. 10, PCA/wavelet encoding (described above) is thenapplied to the structural model 1507 and appearance model 1508 byanalyzer 1510. More generally, analyzer 1510 employs a geometric dataanalysis to compress (encode) the video data corresponding to thecomponents of interest. The resulting compressed (encoded) video data isusable in the FIG. 6 image processing system described above. Inparticular, the models 1506, 1507, 1508 are preferably stored at theencoding and decoding sides 632, 636 of FIG. 6. From the structuralmodel 1507 and appearance model 1508, a finite state machine isgenerated. The conventional coding 632 and decoding 636 can also beimplemented as a conventional wavelet video coding-decoding scheme. Thiswavelet scheme can be employed to synthesize video data whilemaintaining saliency of objects/components of interest. In oneembodiment, during training, for a given video data, the finite statemachine linearly decomposes appearance using wavelet transformtechniques and outputs a normalized (MPEG or similar standard) videocompression stream. During image processing time, the finite statemachine on both sides 632, 636 interpolates pel data (as describedabove) and produces a compressed form of the video data. In this way theinvention state machine synthesizes video data while maintainingsaliency of objects/components of interest.

As discussed above, PCA encoding (or other linear decomposition) isapplied to the normalized pel data on both sides 632 and 636 whichbuilds the same set of basis vectors on each side 632, 636. In apreferred embodiment, PCA/wavelet is applied on the basis functionduring image processing to produce the desired compressed video data.Wavelet techniques (DWT) transform the entire image and sub-image andlinearly decompose the appearance model 1508 and structural model 1507,then this model is truncated gracefully to meet desired threshold goals(ala ECT or SPIHT). This enables scalable video data processing unlikesystems/methods of the prior art due to the “normalized” nature of thevideo data.

Further, given a single pel of one frame of video data, the imageprocessing system 1500 of the present invention is able to predict thesucceeding frame (parameters thereof) due to the application ofPCA/wavelet compression on the structural model 1507 and/or theappearance model 1508.

Accordingly, the present invention may be restated as a predictivemodel. Once the appearance model 1508 and structural model 1507 areestablished as described above, application of geometric data analysistechniques (e.g., sequential PCA, power factorization, generalized PCA,progressive PCA combining PCA/wavelet transform, and the like) to atleast the appearance model 1508 provides encoded video data (sequence offrames) of the components of interest.

In further embodiments, the image processing system of the presentinvention may be represented in spherical terms instead of 3D meshterms. Each component of interest is represented by respectiveellipsoids that contain data from the linear decomposition. For a givencomponent of interest, the minor axis of the ellipsoid definesappearance model 1508 basis vectors and the major axis definesstructural model 1507 basis vectors. Other ellipsoids are suitable suchas hyper ellipsoids. Implicit in this elliptical representation ismotion estimation, a deformation model and an illumination modelsufficient to maintain saliency of objects. As a result, an implicitrepresentation provides for a much more compact encoding of the videodata corresponding to the components of interest.

Virtual Image Sensor

“Illumination” of an object is the natural phenomena of light fallingincident on the object. The illumination changes as a function of θangle of incidence and I light intensity (of the reflectance). A camera(or image sensor generally) effectively samples and records theillumination of the object. The result is a photographic image (e.g.,still snapshot or sequence of video frames) of the object. The pels inthe sample image are attributed to a certain value of θ (angle ofincidence) and a certain value of I (light reflectance intensity). Fordifferent values of θ and/or I, each pel takes on a respective differentdata value. For each pel in the image (or at least for salient objectsin the image), the present invention models the possible pel data valuesfor different values of θ and I. Using this model, one can determine thesubject object's motion, position and pose in a succeeding video orimage data frame given the change in illumination of one pel (i.e., thedifference in that pel's data value between the current video or imagedata frame and the succeeding video or image data frame).

Accordingly, the present invention provides a virtual image sensor,preferably a different virtual sensor for different data. The virtualimage sensor is built according to aspects (quality, representationlimits, etc.) of the respective data. The virtual image sensordiscretely isolates information from the respective image data (i.e.,segments and normalizes or otherwise removes variance), and thatinformation is sufficient to retain saliency or quality of theuncompressed (decoded) version of the data.

A preferred embodiment of the virtual image sensor 1010 of the presentinvention is illustrated in FIG. 11. Source image 12 data (an image dataframe) is received at step 1001. In response, step 1001 applies theabove described object detection, segmentation and normalizationtechniques of preprocessor 1502 to form a model 1507, 1508 of thesalient objects (components of interest) in the image data. The model1507, 1508 includes Lambertian modeling of how facets (pels) illuminateand the corresponding possible pel data values for different values of θand I.

For a given pel in the source image 12, step 1002 analyzes the range ofpossible data values defined by the model 1507, 1508 and compares thecurrent data value as produced by the source camera 11 to the model datavalues, especially those representing a theoretical best resolution.This step 1002 is repeated for other pels in the image data. Based onthe comparisons, step 1002 determines a relationship between the sourcecamera's 11 resolution and a theoretic super resolution as defined bythe model 1507, 1508. Step 1002 represents this relationship as afunction.

Step 1004 applies the resulting function of step 1002 to the sourceimage 12 and extrapolates or otherwise synthesizes an increasedresolution image 1011. Preferably step 1004 produces a super resolvedimage 1011 of the source image 12.

In this way the present invention provides a virtual image sensor 1010.It is noted that the compressed (parameterized version of) data in themodel 1507, 1508 enables such processing (extrapolations andsynthesizing).

FIG. 2 a illustrates a computer network or similar digital processingenvironment in which the present invention may be implemented.

Client computer(s)/devices 50 and server computer(s) 60 provideprocessing, storage, and input/output devices executing applicationprograms and the like. Client computer(s)/devices 50 can also be linkedthrough communications network 70 to other computing devices, includingother client devices/processes 50 and server computer(s) 60.Communications network 70 can be part of a remote access network, aglobal network (e.g., the Internet), a worldwide collection ofcomputers, Local area or Wide area networks, and gateways that currentlyuse respective protocols (TCP/IP, Bluetooth, etc.) to communicate withone another. Other electronic device/computer network architectures aresuitable.

FIG. 2 b is a diagram of the internal structure of a computer (e.g.,client processor/device 50 or server computers 60) in the computersystem of FIG. 2 a. Each computer 50, 60 contains system bus 79, where abus is a set of hardware lines used for data transfer among thecomponents of a computer or processing system. Bus 79 is essentially ashared conduit that connects different elements of a computer system(e.g., processor, disk storage, memory, input/output ports, networkports, etc.) that enables the transfer of information between theelements. Attached to system bus 79 is I/O device interface 82 forconnecting various input and output devices (e.g., keyboard, mouse,displays, printers, speakers, etc.) to the computer 50, 60. Networkinterface 86 allows the computer to connect to various other devicesattached to a network (e.g., network 70 of FIG. 2 a). Memory 90 providesvolatile storage for computer software instructions 92 and data 94 usedto implement an embodiment of the present invention (e.g., lineardecomposition, spatial segmentation, spatial normalization and otherprocessing of FIG. 2 and other figures detailed above). Disk storage 95provides non-volatile storage for computer software instructions 92 anddata 94 used to implement an embodiment of the present invention.Central processor unit 84 is also attached to system bus 79 and providesfor the execution of computer instructions.

In one embodiment, the processor routines 92 and data 94 are a computerprogram product (generally referenced 92), including a computer readablemedium (e.g., a removable storage medium such as one or more DVD-ROM's,CD-ROM's, diskettes, tapes, etc.) that provides at least a portion ofthe software instructions for the invention system. Computer programproduct 92 can be installed by any suitable software installationprocedure, as is well known in the art. In another embodiment, at leasta portion of the software instructions may also be downloaded over acable, communication and/or wireless connection. In other embodiments,the invention programs are a computer program propagated signal product107 embodied on a propagated signal on a propagation medium (e.g., aradio wave, an infrared wave, a laser wave, a sound wave, or anelectrical wave propagated over a global network such as the Internet,or other network(s)). Such carrier medium or signals provide at least aportion of the software instructions for the present inventionroutines/program 92.

In alternate embodiments, the propagated signal is an analog carrierwave or digital signal carried on the propagated medium. For example,the propagated signal may be a digitized signal propagated over a globalnetwork (e.g., the Internet), a telecommunications network, or othernetwork. In one embodiment, the propagated signal is a signal that istransmitted over the propagation medium over a period of time, such asthe instructions for a software application sent in packets over anetwork over a period of milliseconds, seconds, minutes, or longer. Inanother embodiment, the computer readable medium of computer programproduct 92 is a propagation medium that the computer system 50 mayreceive and read, such as by receiving the propagation medium andidentifying a propagated signal embodied in the propagation medium, asdescribed above for computer program propagated signal product.

Generally speaking, the term “carrier medium” or transient carrierencompasses the foregoing transient signals, propagated signals,propagated medium, storage medium and the like.

While this invention has been particularly shown and described withreferences to example embodiments thereof, it will be understood bythose skilled in the art that various changes in form and details may bemade therein without departing from the scope of the inventionencompassed by the appended claims.

What is claimed is:
 1. A data processing system for generating anencoded form of video signal data from a plurality of video frames, thesystem comprising: (a) an object detector configured to detect at leastone object in two or more given video frames based on bandwidthconsumption; (b) an object tracker, in communication with the objectdetector, configured to track the at least one object through the two ormore video frames; (c) a segmenter, in communication with the objectdetector and the object tracker, configured to segment pel datacorresponding to the at least one object from other pel data in the twoor more video frames so as to generate a first intermediate form of thedata, the segmenting utilizing a spatial segmentation of the pel data,the first intermediate form of the data including the segmented pel dataof the at least one object and the other pel data in the two or morevideo frames; and (d) a normalizer, in communication with the segmenter,configured to normalize the first intermediate form of the data togenerate a second intermediate form of the data by: identifyingcorresponding elements of the at least one object in the given two ormore video frames; analyzing the corresponding elements to generaterelationships between the corresponding elements; generatingcorrespondence models by using the generated relationships between thecorresponding elements; integrating the relationships between thecorresponding elements into a model of global motion; and re-samplingpel data associated with the at least one object in the two or morevideo frames by utilizing the correspondence models and model of globalmotion to generate a structural model or an appearance modelrepresenting a second intermediate form of the data.
 2. The dataprocessing system of claim 1 comprising an encoder configured to encodethe second intermediate form of the data by: decomposing the re-sampledpel data into an encoded representation, the encoded representationrepresenting a third intermediate form of the data; truncating zero ormore bytes of the encoded representation; and recomposing the re-sampledpel data from the encoded representation; wherein each of thedecomposing and the recomposing uses Principal Component Analysis. 3.The data processing system of claim 1 wherein the normalizer isconfigured to factor the correspondence models into local deformationmodels by: generating a two dimensional mesh overlying pelscorresponding to the at least one object, the mesh being based on aregular grid of vertices and edges, and generating a model of localmotion from the relationships between the corresponding elements, therelationships comprising vertex displacements based on finitedifferences generated from a block-based motion estimation between twoor more of the video frames.
 4. The data processing system of claim 3wherein the vertices correspond to discrete image features, thenormalizer configured to identify significant image featurescorresponding to the object by using an analysis of the image intensitygradient.
 5. The data processing system of claim 1 further comprising anencoder configured to: (e) restore spatial positions of the re-sampledpel data by utilizing the correspondence models, thereby generatingrestored pels corresponding to the at least one object; and (f)recombine the restored pels together with the other pel data in thefirst intermediate form of the data to create an original video frame;and wherein the second intermediate form of the data is sufficientlyreduced in complexity to enabling data compression by lineardecomposition in an improved manner while maintaining saliency of the atleast one object.
 6. The data processing system of claim 1 wherein theobject detector and the object tracker use a face detector to facilitatethe respective detection and the tracking of the at least one object. 7.The data processing system of claim 1 wherein the normalizer configuredto analyze the corresponding elements further includes the normalizerconfigured to use an appearance-based motion estimator configured tofacilitate appearance-based motion estimation of the at least one objectbetween two or more of the video frames.
 8. A computer-implementedmethod executing on one or more processors that generates an encodedform of video signal data from a plurality of video frames, the methodcomprising: (a) based on bandwidth consumption, detecting at least oneobject in two or more given video frames; (b) tracking the at least oneobject through the two or more video frames; (c) segmenting pel datacorresponding to the at least one object from other pel data in the twoor more video frames so as to generate a first intermediate form of thedata, the segmenting utilizing a spatial segmentation of the pel data,the first intermediate form of the data including the segmented pel dataof the at least one object and the other pel data in the two or morevideo frames; (d) normalizing the first intermediate form of the data togenerate a second intermediate form of the data by performing one ormore of the following steps: identifying corresponding elements of theat least one object in the given two or more video frames; analyzing thecorresponding elements to generate relationships between thecorresponding elements; generating correspondence models by using thegenerated relationships between the corresponding elements; integratingthe relationships between the corresponding elements into a model ofglobal motion; and re-sampling pel data associated with the at least oneobject in the two or more video frames by utilizing the correspondencemodels and model of global motion to generate a structural model or anappearance model representing a second intermediate form of the data. 9.The method of claim 8 comprising encoding the second intermediate formof the data, the encoding comprising: decomposing the re-sampled peldata into an encoded representation, the encoded representationrepresenting a third intermediate form of the data; truncating zero ormore bytes of the encoded representation; and recomposing the re-sampledpel data from the encoded representation; wherein each of thedecomposing and the recomposing uses Principal Component Analysis. 10.The method of claim 8 comprising a method of factoring thecorrespondence models into local deformation models, the methodcomprising: defining a two dimensional mesh overlying pels correspondingto the at least one object, the mesh being based on a regular grid ofvertices and edges, and generating a model of local motion from therelationships between the corresponding elements, the relationshipscomprising vertex displacements based on finite differences generatedfrom a block-based motion estimation between two or more of the videoframes.
 11. The method of claim 10 wherein the vertices correspond todiscrete image features, the method comprising identifying significantimage features corresponding to the object by using an analysis of theimage intensity gradient.
 12. The method of claim 8 further comprisingthe steps of: (e) restoring spatial positions of the re-sampled pel databy utilizing the correspondence models, thereby generating restored pelscorresponding to the at least one object; and (f) recombining therestored pels together with the other pel data in the first intermediateform of the data to create an original video frame; and wherein thesecond intermediate form of the data is sufficiently reduced incomplexity to enabling data compression by linear decomposition in animproved manner while maintaining saliency of the at least one object.13. The method of claim 8 wherein the detecting and tracking compriseusing a face detector.
 14. The method of claim 8 wherein analyzing thecorresponding elements comprises using an appearance-based motionestimation between two or more of the video frames.
 15. A computerapparatus configured to generate an encoded form of video signal datafrom a plurality of video frames, the apparatus comprising: one or morecomputer processors configured to execute: (a) an object detectorconfigured to detect at least one object in two or more given videoframes based on bandwidth consumption; (b) an object tracker, incommunication with the object detector, configured to track the at leastone object through the two or more video frames; (c) a segmenter, incommunication with the object detector and the object tracker,configured to segment pel data corresponding to the at least one objectfrom other pel data in the two or more video frames so as to generate afirst intermediate form of the data, the segmenting utilizing a spatialsegmentation of the pel data, the first intermediate form of the dataincluding the segmented pel data of the at least one object and theother pel data in the two or more video frames; and (d) a normalizer, incommunication with the segmenter, configured to normalize the firstintermediate form of the data to generate a second intermediate form ofthe data by: identifying corresponding elements of the at least oneobject in the given two or more video frames; analyzing thecorresponding elements to generate relationships between thecorresponding elements; generating correspondence models by using thegenerated relationships between the corresponding elements; integratingthe relationships between the corresponding elements into a model ofglobal motion; and re-sampling pel data associated with the at least oneobject in the two or more video frames by utilizing the correspondencemodels and model of global motion to generate a structural model or anappearance model representing a second intermediate form of the data.